In this week’s exercise we will look at clustering and classification with the Boston dataset from the MASS library. Let’s begin by importing the data.
initsetup <- function(k){
setwd(k)
library(Matrix)
library(ggplot2)
library(dplyr)
library(MASS)
library(GGally)
}
initsetup("/Users/veikko/Desktop/datakurssi/IODS-project/data")
getwd()
## [1] "/Users/veikko/Desktop/datakurssi/IODS-project/data"
rm(list=ls())
data("Boston")
# summaries of the data
str(Boston)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
summary(Boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08204 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
Here is the list of variables in the dataset:
Summary and str functions revealed that we have one categorical variable (chas) and one discrete variable (rad). black, crim and zn have very large differences in max and min values, but 1st to 3rd quarter they seem to have belieavable values. One can also note that all variables are in different scales.
Let’s explore the distributions and correlations of different variables in the data set by using ggpairs function.
# there are at least two categorical variables in the data
B1 <- Boston
B1$rad <- as.factor(Boston$rad)
B1$chas <- as.factor(ifelse(Boston$chas==1, "river", "land"))
# custom ggpairs from https://github.com/ggobi/ggally/issues/139
library(RColorBrewer)
corColors <- RColorBrewer::brewer.pal(n = 7, name = "RdYlGn")[2:6]
corColors
## [1] "#FC8D59" "#FEE08B" "#FFFFBF" "#D9EF8B" "#91CF60"
my_custom_cor_color <- function(data, mapping, color = I("black"), sizeRange = c(1, 5), ...) {
# get the x and y data to use the other code
x <- eval(mapping$x, data)
y <- eval(mapping$y, data)
ct <- cor.test(x,y)
r <- unname(ct$estimate)
rt <- format(r, digits=2)[1]
tt <- as.character(rt)
# plot the cor value
p <- ggally_text(
label = tt,
mapping = aes(),
xP = 0.5, yP = 0.5,
size = 6,
color=color,
...
) +
theme(
panel.background=element_rect(fill="white", color = "black"),
panel.grid.minor=element_blank(),
panel.grid.major=element_blank()
)
corColors <- RColorBrewer::brewer.pal(n = 7, name = "RdYlGn")[2:6]
if (r <= -0.8) {
corCol <- corColors[1]
} else if (r <= -0.6) {
corCol <- corColors[2]
} else if (r < 0.6) {
corCol <- corColors[3]
} else if (r < 0.8) {
corCol <- corColors[4]
} else {
corCol <- corColors[5]
}
p <- p + theme(
panel.background = element_rect(fill= corCol)
)
p
}
ggpairs(
B1,
columns = 1:14,
upper = list(continuous = my_custom_cor_color),
diag = list(combo = "facethist"),
lower = list(
combo = wrap("facethist", bins = 100)
)
)
Here we can see correlations much easier than with basic ggpairs settings. With the basic settings there are just so many variables which makes the plot difficult to interpret without any color coding. Here one has to remember that I did not test the correlations with statistical signifficance.
Some insights from the ggpairs:
if we look at the medv variable, which indicates median value of appartmens, we can see that it correlates very negatively with crime and lstat. This means that more valuable the appartment the less crime and social classes there are. One can also see that age and lstat correlate positively with each other which means that lower social status is connected with older buildings in the neighbourhood. There is also very strong positive correlation between taxand indus. This indicates high property tax yields are present in areas with high levels of industrial activity.
# x-mean(x) / sd(x)
boston_scaled <- scale(Boston)
summary(boston_scaled)
## crim zn indus
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668
## Median :-0.390280 Median :-0.48724 Median :-0.2109
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202
## chas nox rm age
## Min. :-0.2723 Min. :-1.4644 Min. :-3.8764 Min. :-2.3331
## 1st Qu.:-0.2723 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366
## Median :-0.2723 Median :-0.1441 Median :-0.1084 Median : 0.3171
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.:-0.2723 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059
## Max. : 3.6648 Max. : 2.7296 Max. : 3.5515 Max. : 1.1164
## dis rad tax ptratio
## Min. :-1.2658 Min. :-0.9819 Min. :-1.3127 Min. :-2.7047
## 1st Qu.:-0.8049 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876
## Median :-0.2790 Median :-0.5225 Median :-0.4642 Median : 0.2746
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6617 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058
## Max. : 3.9566 Max. : 1.6596 Max. : 1.7964 Max. : 1.6372
## black lstat medv
## Min. :-3.9033 Min. :-1.5296 Min. :-1.9063
## 1st Qu.: 0.2049 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median : 0.3808 Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.4332 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 0.4406 Max. : 3.5453 Max. : 2.9865
class(boston_scaled)
## [1] "matrix"
boston_scaled <- as.data.frame(boston_scaled)
# creating a factor variable
scaled_crim <- boston_scaled[, "crim"]
summary(scaled_crim)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.419400 -0.410600 -0.390300 0.000000 0.007389 9.924000
# bins <- quantile(scaled_crim)
crime <- as.numeric(cut_number(scaled_crim,4)) %>%
factor(labels = c("low", "med_low", "med_high", "high"))
table(crime)
## crime
## low med_low med_high high
## 127 126 126 127
# remove old and add new
boston_scaled <- dplyr::select(boston_scaled, -crim)
boston_scaled <- data.frame(boston_scaled, crime)
Above I did many things: 1. scaled all variables in Boston dataset (note that all variables need to be numerical) 2. summarized new scaled dataset 3. conversed scaled data back to a data frame 4. then I cut the scaled_crim quantiles and created a new factor variable crime 5. finally, I removed old numeric variable and replaced it with the categorical variable
As we can see, the variables are now all scaled, meaning for each value: \((x -mean(x))/sd(x)\) Variables now obtain negative values as their min values till mean gets the value of zero and after mean value all values are positive.
Next, I will divide the data into a training set and a test set, so that 80% of the datapoints will be in the training set. I ’ll also remove the real values of crime variable from the test set into another variable of its own called correct_classes.
# creating training and test set
set.seed(12)
ind <- nrow(boston_scaled) %>%
sample(. * 0.8)
train <- boston_scaled[ind,]
test <- boston_scaled[-ind,]
correct_classes <- test[, "crime"]
# test set ready for prediction
test <- dplyr::select(test, -crime)
Now we turn into performing linear discriminant analysis fitting on the training set. We will also visualize the results.
# fitting the model on the training set
lda.fit <- lda(crime ~. , data = train)
lda.fit
## Call:
## lda(crime ~ ., data = train)
##
## Prior probabilities of groups:
## low med_low med_high high
## 0.2623762 0.2574257 0.2400990 0.2400990
##
## Group means:
## zn indus chas nox rm
## low 0.94307637 -0.9127350 -0.08661679 -0.8744482 0.4412043
## med_low -0.09454334 -0.3418261 0.03052480 -0.5929899 -0.1265458
## med_high -0.37938430 0.1884615 0.25532354 0.3595668 0.1475993
## high -0.48724019 1.0172187 -0.06938576 1.0430995 -0.4440326
## age dis rad tax ptratio
## low -0.9260201 0.8575468 -0.6904206 -0.7365904 -0.45879651
## med_low -0.3434658 0.3686088 -0.5478832 -0.5261717 -0.09316052
## med_high 0.4063941 -0.3494648 -0.4029017 -0.3114126 -0.30231831
## high 0.8250080 -0.8654942 1.6371072 1.5133254 0.77958792
## black lstat medv
## low 0.37693462 -0.76152029 0.54562143
## med_low 0.31593747 -0.13281609 0.00270585
## med_high 0.07074154 -0.02403283 0.19304349
## high -0.80338538 0.95065549 -0.70280102
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3
## zn 0.07707055 0.737746735 -0.87329135
## indus 0.06895628 -0.349487786 0.25520750
## chas -0.09673761 -0.051237460 0.09856258
## nox 0.40432895 -0.593237949 -1.47606017
## rm -0.09136114 -0.087050917 -0.12198778
## age 0.23344441 -0.395445903 -0.04306557
## dis -0.03332099 -0.326460028 0.06214625
## rad 3.09520048 0.944267182 0.13844589
## tax 0.04398345 -0.022798137 0.35384722
## ptratio 0.08659163 0.084704616 -0.30507815
## black -0.11029035 0.009633805 0.14947037
## lstat 0.27733959 -0.230885707 0.33364507
## medv 0.23149514 -0.353311333 -0.30490501
##
## Proportion of trace:
## LD1 LD2 LD3
## 0.9486 0.0384 0.0129
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
# target classes as numeric
classes <- as.numeric(train$crime)
plot(lda.fit, dimen = 2, col = classes, pch = classes)
lda.arrows(lda.fit, myscale = 3)
Here we perform the prediction with the trained LDA model. Then we will cross tabulate the classes of the prediction versus correct classes.
lda.pred <- predict(lda.fit, newdata = test)
table(correct = correct_classes, predicted = lda.pred$class)
## predicted
## correct low med_low med_high high
## low 12 8 1 0
## med_low 3 12 7 0
## med_high 0 9 19 1
## high 0 0 0 30
As we can see, our LDA prediction was good at predicting extreme values, i.e. high and low crime having not that many values out of the diagonal, as there are only one misclassification in high class and three in low class. med_low and med_high had more issues in determining the correct classes.
Let’s perform K-means clustering algorithm on the dataset. For this procedure we need to calculate distances between observations and the dataset needs to be scaled to get comparable results. Then we shall determine the optimal number of clusters, by visually looking at distance measures with different k. After that we visualize the dataset by coloring the observations by their cluster.
data('Boston')
B2 <- scale(Boston)
B2 <- as.data.frame(B2)
dist_eu <- dist(B2, method = "euclidean")
summary(dist_eu)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1343 3.4620 4.8240 4.9110 6.1860 14.4000
dist_man <- dist(B2, method = "manhattan")
summary(dist_man)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2662 8.4830 12.6100 13.5500 17.7600 48.8600
# let's determine the optimal number of clusters
k_max <- 10
twcss <- sapply(1:k_max, function(k){kmeans(dist_eu, k)$tot.withinss})
plot(1:k_max, twcss, type='b')
We can clearly see that the greatest drop in the plot happens with k=2. So, we choose k=2 to be the optimal number of clusters.
# creating plots to interpret results
km <-kmeans(dist_eu, centers = 2)
asd <- as.data.frame(km$cluster) %>%
.[, 1] %>%
as.factor(.)
Boston["clusters"] <- asd
ggpairs(Boston, mapping = aes(col = clusters, alpha = 0.3), columns = 1:14,
lower = list(combo = wrap("facethist", bins = 100)))
# histograms
library(reshape2)
d <- melt(Boston[,-c(4)])
ggplot(d,aes(x = value, fill=clusters)) +
facet_wrap(~variable,scales = "free_x") +
geom_histogram()
It seems that crime low crime is almost always classified as blue and high crime red against all variables. We can see that in some dimensions classes are somewhat mixed, but in some there is a clear division between classes.
Next, we can move onto the first BONUS task.
# dist_eu is calculated from scaled Boston dataset called B2
km <-kmeans(dist_eu, centers = 4)
asd <- as.data.frame(km$cluster) %>%
.[, 1] %>%
as.factor(.)
B2["clusters"] <- asd
lda.fit2 <- lda(clusters ~., data = B2)
lda.fit2
## Call:
## lda(clusters ~ ., data = B2)
##
## Prior probabilities of groups:
## 1 2 3 4
## 0.4308300 0.1304348 0.2272727 0.2114625
##
## Group means:
## crim zn indus chas nox rm
## 1 -0.3894453 -0.2173896 -0.5212959 -0.2723291 -0.5203495 -0.1157814
## 2 1.4330759 -0.4872402 1.0689719 0.4435073 1.3439101 -0.7461469
## 3 0.2797949 -0.4872402 1.1892663 -0.2723291 0.8998296 -0.2770011
## 4 -0.3912182 1.2671159 -0.8754697 0.5739635 -0.7359091 0.9938426
## age dis rad tax ptratio black
## 1 -0.3256000 0.3182404 -0.5741127 -0.6240070 0.02986213 0.34248644
## 2 0.8575386 -0.9620552 1.2941816 1.2970210 0.42015742 -1.65562038
## 3 0.7716696 -0.7723199 0.9006160 1.0311612 0.60093343 -0.01717546
## 4 -0.6949417 0.7751031 -0.5965444 -0.6369476 -0.96586616 0.34190729
## lstat medv
## 1 -0.2813666 -0.01314324
## 2 1.1930953 -0.81904111
## 3 0.6116223 -0.54636549
## 4 -0.8200275 1.11919598
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3
## crim 0.18113078 0.5012256 0.60535205
## zn 0.43297497 1.0486194 -0.67406151
## indus 1.37753200 -0.3016928 -1.07034034
## chas -0.04307937 0.7598229 0.22448239
## nox 1.04674638 0.3861005 0.33268952
## rm -0.14912869 0.1510367 -0.67942589
## age -0.09897424 -0.0523110 -0.26285587
## dis 0.13139210 0.1593367 0.03487882
## rad 0.65824136 -0.5189795 -0.48145070
## tax 0.28903561 0.5773959 -0.10350513
## ptratio 0.22236843 -0.1668597 0.09181715
## black -0.42730704 -0.5843973 -0.89869354
## lstat 0.24320629 0.6197780 0.01119242
## medv 0.21961575 0.9485829 0.17065360
##
## Proportion of trace:
## LD1 LD2 LD3
## 0.7596 0.1768 0.0636
# target classes as numeric
classes <- as.numeric(B2$clusters)
plot(lda.fit2, dimen = 2, col = classes, pch = classes)
lda.arrows(lda.fit2, myscale = 6)
This looks interesting. The biplot shows us which variables are dragging the observations into certain clusters. For instance crime, nox, medw variables are pulling the observation towards class 2, rad towards class 3, black towards class 1.
Finally, let’s have a try at the Super-Bonus task.
# super bonus
model_predictors <- dplyr::select(train, -crime)
dim(model_predictors)
## [1] 404 13
dim(lda.fit$scaling)
## [1] 13 3
matrix_product <- as.matrix(model_predictors) %*% lda.fit$scaling
matrix_product <- as.data.frame(matrix_product)
library(plotly)
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2,
z = matrix_product$LD3, type= 'scatter3d', mode='markers', color = train$crime)
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2,
z = matrix_product$LD3, type= 'scatter3d', mode='markers', color = Boston$clusters[ind])
The two plots have similarities in terms of their grouping of observations. It seems that high and med_high are clustered together as class 1 and low and med_low as class 2. There are some misclassifications (as earlier seen in cross tabulation), but this visualizes very nicely, how to spot individual misclassifications.
I started off this week’s exercise with data wrangling. I read the data from UCI Machine Learning Repository.
initialsetup <- function(projectWD){
setwd(projectWD)
library(magrittr)
library(dplyr)
library(tidyr)
library(ggplot2)
}
k <- "/Users/veikko/Desktop/datakurssi/IODS-project/data"
initialsetup(k)
getwd()
## [1] "/Users/veikko/Desktop/datakurssi/IODS-project/data"
math <- read.csv("student-mat.csv", header = T, sep = ";")
por <- read.csv("student-por.csv", header = T, sep = ";")
dim(math)
## [1] 395 33
dim(por)
## [1] 649 33
colnames(math)
## [1] "school" "sex" "age" "address" "famsize"
## [6] "Pstatus" "Medu" "Fedu" "Mjob" "Fjob"
## [11] "reason" "guardian" "traveltime" "studytime" "failures"
## [16] "schoolsup" "famsup" "paid" "activities" "nursery"
## [21] "higher" "internet" "romantic" "famrel" "freetime"
## [26] "goout" "Dalc" "Walc" "health" "absences"
## [31] "G1" "G2" "G3"
head(math)
## school sex age address famsize Pstatus Medu Fedu Mjob Fjob
## 1 GP F 18 U GT3 A 4 4 at_home teacher
## 2 GP F 17 U GT3 T 1 1 at_home other
## 3 GP F 15 U LE3 T 1 1 at_home other
## 4 GP F 15 U GT3 T 4 2 health services
## 5 GP F 16 U GT3 T 3 3 other other
## 6 GP M 16 U LE3 T 4 3 services other
## reason guardian traveltime studytime failures schoolsup famsup paid
## 1 course mother 2 2 0 yes no no
## 2 course father 1 2 0 no yes no
## 3 other mother 1 2 3 yes no yes
## 4 home mother 1 3 0 no yes yes
## 5 home father 1 2 0 no yes yes
## 6 reputation mother 1 2 0 no yes yes
## activities nursery higher internet romantic famrel freetime goout Dalc
## 1 no yes yes no no 4 3 4 1
## 2 no no yes yes no 5 3 3 1
## 3 no yes yes yes no 4 3 2 2
## 4 yes yes yes yes yes 3 2 2 1
## 5 no yes yes no no 4 3 2 1
## 6 yes yes yes yes no 5 4 2 1
## Walc health absences G1 G2 G3
## 1 1 3 6 5 6 6
## 2 1 3 4 5 5 6
## 3 3 3 10 7 8 10
## 4 1 5 2 15 14 15
## 5 2 5 4 6 10 10
## 6 2 5 10 15 15 15
After inspecting the data I continued by combining mat and por datasets with predetermined variables. These variables were “school”, “sex”, “age”, “address”, “famsize”, “Pstatus”, “Medu”, “Fedu”, “Mjob”, “Fjob”, “reason”, “nursery” and “internet”.
join_by <- c("school","sex","age","address","famsize","Pstatus","Medu",
"Fedu","Mjob","Fjob","reason","nursery","internet")
#alternatively: math_por <- inner_join(math, por, by = join_by, suffix = c(".math", ".por"))
math_por <- math %>%
left_join(por, by = join_by, suffix = c(".math", ".por")) %>%
drop_na()
glimpse(math_por)
## Observations: 382
## Variables: 53
## $ school <fctr> GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, G...
## $ sex <fctr> F, F, F, F, F, M, M, F, M, M, F, F, M, M, M, ...
## $ age <int> 18, 17, 15, 15, 16, 16, 16, 17, 15, 15, 15, 15...
## $ address <fctr> U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, ...
## $ famsize <fctr> GT3, GT3, LE3, GT3, GT3, LE3, LE3, GT3, LE3, ...
## $ Pstatus <fctr> A, T, T, T, T, T, T, A, A, T, T, T, T, T, A, ...
## $ Medu <int> 4, 1, 1, 4, 3, 4, 2, 4, 3, 3, 4, 2, 4, 4, 2, 4...
## $ Fedu <int> 4, 1, 1, 2, 3, 3, 2, 4, 2, 4, 4, 1, 4, 3, 2, 4...
## $ Mjob <fctr> at_home, at_home, at_home, health, other, ser...
## $ Fjob <fctr> teacher, other, other, services, other, other...
## $ reason <fctr> course, course, other, home, home, reputation...
## $ guardian.math <fctr> mother, father, mother, mother, father, mothe...
## $ traveltime.math <int> 2, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 3, 1, 2, 1, 1...
## $ studytime.math <int> 2, 2, 2, 3, 2, 2, 2, 2, 2, 2, 2, 3, 1, 2, 3, 1...
## $ failures.math <int> 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ schoolsup.math <fctr> yes, no, yes, no, no, no, no, yes, no, no, no...
## $ famsup.math <fctr> no, yes, no, yes, yes, yes, no, yes, yes, yes...
## $ paid.math <fctr> no, no, yes, yes, yes, yes, no, no, yes, yes,...
## $ activities.math <fctr> no, no, no, yes, no, yes, no, no, no, yes, no...
## $ nursery <fctr> yes, no, yes, yes, yes, yes, yes, yes, yes, y...
## $ higher.math <fctr> yes, yes, yes, yes, yes, yes, yes, yes, yes, ...
## $ internet <fctr> no, yes, yes, yes, no, yes, yes, no, yes, yes...
## $ romantic.math <fctr> no, no, no, yes, no, no, no, no, no, no, no, ...
## $ famrel.math <int> 4, 5, 4, 3, 4, 5, 4, 4, 4, 5, 3, 5, 4, 5, 4, 4...
## $ freetime.math <int> 3, 3, 3, 2, 3, 4, 4, 1, 2, 5, 3, 2, 3, 4, 5, 4...
## $ goout.math <int> 4, 3, 2, 2, 2, 2, 4, 4, 2, 1, 3, 2, 3, 3, 2, 4...
## $ Dalc.math <int> 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ Walc.math <int> 1, 1, 3, 1, 2, 2, 1, 1, 1, 1, 2, 1, 3, 2, 1, 2...
## $ health.math <int> 3, 3, 3, 5, 5, 5, 3, 1, 1, 5, 2, 4, 5, 3, 3, 2...
## $ absences.math <int> 6, 4, 10, 2, 4, 10, 0, 6, 0, 0, 0, 4, 2, 2, 0,...
## $ G1.math <int> 5, 5, 7, 15, 6, 15, 12, 6, 16, 14, 10, 10, 14,...
## $ G2.math <int> 6, 5, 8, 14, 10, 15, 12, 5, 18, 15, 8, 12, 14,...
## $ G3.math <int> 6, 6, 10, 15, 10, 15, 11, 6, 19, 15, 9, 12, 14...
## $ guardian.por <fctr> mother, father, mother, mother, father, mothe...
## $ traveltime.por <int> 2, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 3, 1, 2, 1, 1...
## $ studytime.por <int> 2, 2, 2, 3, 2, 2, 2, 2, 2, 2, 2, 3, 1, 2, 3, 1...
## $ failures.por <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ schoolsup.por <fctr> yes, no, yes, no, no, no, no, yes, no, no, no...
## $ famsup.por <fctr> no, yes, no, yes, yes, yes, no, yes, yes, yes...
## $ paid.por <fctr> no, no, no, no, no, no, no, no, no, no, no, n...
## $ activities.por <fctr> no, no, no, yes, no, yes, no, no, no, yes, no...
## $ higher.por <fctr> yes, yes, yes, yes, yes, yes, yes, yes, yes, ...
## $ romantic.por <fctr> no, no, no, yes, no, no, no, no, no, no, no, ...
## $ famrel.por <int> 4, 5, 4, 3, 4, 5, 4, 4, 4, 5, 3, 5, 4, 5, 4, 4...
## $ freetime.por <int> 3, 3, 3, 2, 3, 4, 4, 1, 2, 5, 3, 2, 3, 4, 5, 4...
## $ goout.por <int> 4, 3, 2, 2, 2, 2, 4, 4, 2, 1, 3, 2, 3, 3, 2, 4...
## $ Dalc.por <int> 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ Walc.por <int> 1, 1, 3, 1, 2, 2, 1, 1, 1, 1, 2, 1, 3, 2, 1, 2...
## $ health.por <int> 3, 3, 3, 5, 5, 5, 3, 1, 1, 5, 2, 4, 5, 3, 3, 2...
## $ absences.por <int> 4, 2, 6, 0, 0, 6, 0, 2, 0, 0, 2, 0, 0, 0, 0, 6...
## $ G1.por <int> 0, 9, 12, 14, 11, 12, 13, 10, 15, 12, 14, 10, ...
## $ G2.por <int> 11, 11, 13, 14, 13, 12, 12, 13, 16, 12, 14, 12...
## $ G3.por <int> 11, 11, 12, 14, 13, 13, 13, 13, 17, 13, 14, 13...
alc <- math_por %>%
select(one_of(join_by))
# the columns in the datasets which were not used for joining the data
notjoined_columns <- colnames(math)[!colnames(math) %in% join_by]
Now we do the most difficult thing of the data wrangling. We select from the combined dataset variables that have same original names, but different suffixes and if those variables have numerical values we take an average of the two columns by rows with the rowMeans. I was trying to implement this with my own code, but it turned out to be very demanding (maybe I just did not find the right function, but if you got any ideas feel free to comment). After this procedure I created two new variables “alc_use” and “high_use”. Finally, I wrote the data into a csv-file.
# for every column name not used for joining...
# copied from Data camp as I could not get my alternative solution for this to work
for(column_name in notjoined_columns) {
two_columns <- select(math_por, starts_with(column_name))
first_column <- select(two_columns, 1)[[1]]
if(is.numeric(first_column)) {
alc[column_name] <- round(rowMeans(two_columns))
} else {
alc[column_name] <- first_column
}
}
glimpse(alc)
## Observations: 382
## Variables: 33
## $ school <fctr> GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP...
## $ sex <fctr> F, F, F, F, F, M, M, F, M, M, F, F, M, M, M, F, F,...
## $ age <int> 18, 17, 15, 15, 16, 16, 16, 17, 15, 15, 15, 15, 15,...
## $ address <fctr> U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U,...
## $ famsize <fctr> GT3, GT3, LE3, GT3, GT3, LE3, LE3, GT3, LE3, GT3, ...
## $ Pstatus <fctr> A, T, T, T, T, T, T, A, A, T, T, T, T, T, A, T, T,...
## $ Medu <int> 4, 1, 1, 4, 3, 4, 2, 4, 3, 3, 4, 2, 4, 4, 2, 4, 4, ...
## $ Fedu <int> 4, 1, 1, 2, 3, 3, 2, 4, 2, 4, 4, 1, 4, 3, 2, 4, 4, ...
## $ Mjob <fctr> at_home, at_home, at_home, health, other, services...
## $ Fjob <fctr> teacher, other, other, services, other, other, oth...
## $ reason <fctr> course, course, other, home, home, reputation, hom...
## $ nursery <fctr> yes, no, yes, yes, yes, yes, yes, yes, yes, yes, y...
## $ internet <fctr> no, yes, yes, yes, no, yes, yes, no, yes, yes, yes...
## $ guardian <fctr> mother, father, mother, mother, father, mother, mo...
## $ traveltime <dbl> 2, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 3, 1, 2, 1, 1, 1, ...
## $ studytime <dbl> 2, 2, 2, 3, 2, 2, 2, 2, 2, 2, 2, 3, 1, 2, 3, 1, 3, ...
## $ failures <dbl> 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ schoolsup <fctr> yes, no, yes, no, no, no, no, yes, no, no, no, no,...
## $ famsup <fctr> no, yes, no, yes, yes, yes, no, yes, yes, yes, yes...
## $ paid <fctr> no, no, yes, yes, yes, yes, no, no, yes, yes, yes,...
## $ activities <fctr> no, no, no, yes, no, yes, no, no, no, yes, no, yes...
## $ higher <fctr> yes, yes, yes, yes, yes, yes, yes, yes, yes, yes, ...
## $ romantic <fctr> no, no, no, yes, no, no, no, no, no, no, no, no, n...
## $ famrel <dbl> 4, 5, 4, 3, 4, 5, 4, 4, 4, 5, 3, 5, 4, 5, 4, 4, 3, ...
## $ freetime <dbl> 3, 3, 3, 2, 3, 4, 4, 1, 2, 5, 3, 2, 3, 4, 5, 4, 2, ...
## $ goout <dbl> 4, 3, 2, 2, 2, 2, 4, 4, 2, 1, 3, 2, 3, 3, 2, 4, 3, ...
## $ Dalc <dbl> 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ Walc <dbl> 1, 1, 3, 1, 2, 2, 1, 1, 1, 1, 2, 1, 3, 2, 1, 2, 2, ...
## $ health <dbl> 3, 3, 3, 5, 5, 5, 3, 1, 1, 5, 2, 4, 5, 3, 3, 2, 2, ...
## $ absences <dbl> 5, 3, 8, 1, 2, 8, 0, 4, 0, 0, 1, 2, 1, 1, 0, 5, 8, ...
## $ G1 <dbl> 2, 7, 10, 14, 8, 14, 12, 8, 16, 13, 12, 10, 13, 11,...
## $ G2 <dbl> 8, 8, 10, 14, 12, 14, 12, 9, 17, 14, 11, 12, 14, 11...
## $ G3 <dbl> 8, 8, 11, 14, 12, 14, 12, 10, 18, 14, 12, 12, 13, 1...
alc <- alc %>%
mutate(alc_use = (Walc + Dalc)/2, high_use = alc_use > 2)
glimpse(alc)
## Observations: 382
## Variables: 35
## $ school <fctr> GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP...
## $ sex <fctr> F, F, F, F, F, M, M, F, M, M, F, F, M, M, M, F, F,...
## $ age <int> 18, 17, 15, 15, 16, 16, 16, 17, 15, 15, 15, 15, 15,...
## $ address <fctr> U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U,...
## $ famsize <fctr> GT3, GT3, LE3, GT3, GT3, LE3, LE3, GT3, LE3, GT3, ...
## $ Pstatus <fctr> A, T, T, T, T, T, T, A, A, T, T, T, T, T, A, T, T,...
## $ Medu <int> 4, 1, 1, 4, 3, 4, 2, 4, 3, 3, 4, 2, 4, 4, 2, 4, 4, ...
## $ Fedu <int> 4, 1, 1, 2, 3, 3, 2, 4, 2, 4, 4, 1, 4, 3, 2, 4, 4, ...
## $ Mjob <fctr> at_home, at_home, at_home, health, other, services...
## $ Fjob <fctr> teacher, other, other, services, other, other, oth...
## $ reason <fctr> course, course, other, home, home, reputation, hom...
## $ nursery <fctr> yes, no, yes, yes, yes, yes, yes, yes, yes, yes, y...
## $ internet <fctr> no, yes, yes, yes, no, yes, yes, no, yes, yes, yes...
## $ guardian <fctr> mother, father, mother, mother, father, mother, mo...
## $ traveltime <dbl> 2, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 3, 1, 2, 1, 1, 1, ...
## $ studytime <dbl> 2, 2, 2, 3, 2, 2, 2, 2, 2, 2, 2, 3, 1, 2, 3, 1, 3, ...
## $ failures <dbl> 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ schoolsup <fctr> yes, no, yes, no, no, no, no, yes, no, no, no, no,...
## $ famsup <fctr> no, yes, no, yes, yes, yes, no, yes, yes, yes, yes...
## $ paid <fctr> no, no, yes, yes, yes, yes, no, no, yes, yes, yes,...
## $ activities <fctr> no, no, no, yes, no, yes, no, no, no, yes, no, yes...
## $ higher <fctr> yes, yes, yes, yes, yes, yes, yes, yes, yes, yes, ...
## $ romantic <fctr> no, no, no, yes, no, no, no, no, no, no, no, no, n...
## $ famrel <dbl> 4, 5, 4, 3, 4, 5, 4, 4, 4, 5, 3, 5, 4, 5, 4, 4, 3, ...
## $ freetime <dbl> 3, 3, 3, 2, 3, 4, 4, 1, 2, 5, 3, 2, 3, 4, 5, 4, 2, ...
## $ goout <dbl> 4, 3, 2, 2, 2, 2, 4, 4, 2, 1, 3, 2, 3, 3, 2, 4, 3, ...
## $ Dalc <dbl> 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ Walc <dbl> 1, 1, 3, 1, 2, 2, 1, 1, 1, 1, 2, 1, 3, 2, 1, 2, 2, ...
## $ health <dbl> 3, 3, 3, 5, 5, 5, 3, 1, 1, 5, 2, 4, 5, 3, 3, 2, 2, ...
## $ absences <dbl> 5, 3, 8, 1, 2, 8, 0, 4, 0, 0, 1, 2, 1, 1, 0, 5, 8, ...
## $ G1 <dbl> 2, 7, 10, 14, 8, 14, 12, 8, 16, 13, 12, 10, 13, 11,...
## $ G2 <dbl> 8, 8, 10, 14, 12, 14, 12, 9, 17, 14, 11, 12, 14, 11...
## $ G3 <dbl> 8, 8, 11, 14, 12, 14, 12, 10, 18, 14, 12, 12, 13, 1...
## $ alc_use <dbl> 1.0, 1.0, 2.5, 1.0, 1.5, 1.5, 1.0, 1.0, 1.0, 1.0, 1...
## $ high_use <lgl> FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, FAL...
write.csv(alc, file = "alcohol_data.csv", row.names = F)
Now as the data is looking as instructed (382 observations and 35 variables) we can move to the analysis part.
As a brief recap our dataset consists of 35 variables and 382 observations. You can see the previous part to know more about the data set. Here I quickly illustrate how the “alc_use” variable looks like. Remember that high_use is counted as TRUE as alc_use is greater than 2.
colnames(alc)
## [1] "school" "sex" "age" "address" "famsize"
## [6] "Pstatus" "Medu" "Fedu" "Mjob" "Fjob"
## [11] "reason" "nursery" "internet" "guardian" "traveltime"
## [16] "studytime" "failures" "schoolsup" "famsup" "paid"
## [21] "activities" "higher" "romantic" "famrel" "freetime"
## [26] "goout" "Dalc" "Walc" "health" "absences"
## [31] "G1" "G2" "G3" "alc_use" "high_use"
dim(alc)
## [1] 382 35
g1 <- ggplot(data = alc, aes(x = alc_use, fill = sex))
g1 + geom_bar()
I chose these four as my explanatory variables:
Absences hypothesis: high absence should indicate high alcohol consumption (people do not go classes drunk or hangover)
Family relations hypothesis: bad relations (low famrel) could increase the odds of having high alcohol consumption
Health hypothesis: low health may cause alchol consumption, but might also be an effect of it
Going out hypothesis: alcoholism is also connected to anti-social behavior in older age, but we should expect drinking to be social act for students of this age
Now let’s plot the data regarding selected variables. For the sace of clarity I will classify answers given to the variables of interest as “low” or “high”. I will provide boxplots for the grouped variables and also scatterplots for the original unclassified values of the variables. I will use sex to label/group observations on the plots.
repeter <- function(var_name){
alc$new <- as.numeric(cut_number(var_name,2))
factor(alc$new, labels = c("low", "high"))
}
# change variable name and parameter to create factor levels
alc$goout_group <- repeter(alc$goout)
alc$abs_group <- repeter(alc$absences)
alc$famrel_group <- repeter(alc$famrel)
alc$health_group <- repeter(alc$health)
plotData <- alc %>%
select(goout_group, abs_group, famrel_group, health_group, sex, alc_use)
plotData2 <- alc %>%
select(goout, absences, famrel, health, sex, alc_use)
plotData %>%
gather(-alc_use, -sex, key = "var", value = "value") %>%
ggplot(aes(x = value, y = alc_use, color = sex)) +
geom_boxplot() +
facet_wrap(~ var, scales = "free") + ylab("Alchol Consumption") + xlab(" ") + ggtitle("Alcohol Consumption by 4 Variables")
plotData2 %>%
gather(-alc_use, -sex, key = "var", value = "value") %>%
ggplot(aes(x = value, y = alc_use, color = sex)) +
geom_point(position = "jitter") +
facet_wrap(~ var, scales = "free") + ylab("Alchol Consumption") + xlab(" ") + ggtitle("Alcohol Consumption by 4 Variables")
#summarising the data
alc %>% group_by(sex, high_use) %>% summarise(count = n(), mean_age = mean(age), mean_absences = mean(absences),
mean_health = mean(health), mean_goout = mean(goout),
mean_famrel = mean(famrel), mean_grade = mean(G3))
## Source: local data frame [4 x 9]
## Groups: sex [?]
##
## sex high_use count mean_age mean_absences mean_health mean_goout
## <fctr> <lgl> <int> <dbl> <dbl> <dbl> <dbl>
## 1 F FALSE 156 16.63462 4.224359 3.378205 2.955128
## 2 F TRUE 42 16.50000 6.785714 3.404762 3.357143
## 3 M FALSE 112 16.31250 2.982143 3.714286 2.714286
## 4 M TRUE 72 16.95833 6.125000 3.875000 3.930556
## # ... with 2 more variables: mean_famrel <dbl>, mean_grade <dbl>
From the plots we can see that gender was important in predicting alcohol usage. Men drink more and this is visible in all plots where as females do not seem to be affected by the effect of different variables. Regarding our hypotheses we can see that health does not sem to have a strong relation to alcohol consumption, but other variables do have. Low family relations make men drink more. High value of going out is strongly connected to increased alcohol consumption in men. Those students who are absent from class tend to also drink more. From the summary table we can see numerical data between high_use variable and other interesting variables.
Now we can do the logistic regression for high_use variable.
#logistic regression
m <- glm(high_use ~ absences + famrel + goout + health, data = alc, family = "binomial")
summary(m)
##
## Call:
## glm(formula = high_use ~ absences + famrel + goout + health,
## family = "binomial", data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.7888 -0.7668 -0.5299 0.9199 2.3872
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.92507 0.70158 -4.169 3.06e-05 ***
## absences 0.07449 0.02150 3.465 0.000531 ***
## famrel -0.38409 0.13783 -2.787 0.005326 **
## goout 0.78580 0.12050 6.521 6.98e-11 ***
## health 0.17119 0.09200 1.861 0.062775 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 465.68 on 381 degrees of freedom
## Residual deviance: 392.24 on 377 degrees of freedom
## AIC: 402.24
##
## Number of Fisher Scoring iterations: 4
coef(m)
## (Intercept) absences famrel goout health
## -2.92506735 0.07449141 -0.38409075 0.78580008 0.17118709
OR <- coef(m) %>% exp
CI <- confint(m) %>% exp
cbind(OR, CI)
## OR 2.5 % 97.5 %
## (Intercept) 0.05366108 0.0130406 0.2058018
## absences 1.07733608 1.0339972 1.1265093
## famrel 0.68106961 0.5180869 0.8911229
## goout 2.19416174 1.7431850 2.7986922
## health 1.18671275 0.9932560 1.4258286
probabilities <- predict(m, type = "response")
alc <- alc %>%
mutate(probability = probabilities) %>%
mutate(prediction = probability > 0.5)
# tabulate the target variable versus the predictions
table(high_use = alc$high_use, prediction = alc$prediction)
## prediction
## high_use FALSE TRUE
## FALSE 246 22
## TRUE 69 45
g <- ggplot(alc, aes(x = probability, y = high_use, col = prediction))
g + geom_point(position = "jitter")
# tabulate the target variable versus the predictions
table(high_use = alc$high_use, prediction = alc$prediction) %>% prop.table() %>% addmargins
## prediction
## high_use FALSE TRUE Sum
## FALSE 0.64397906 0.05759162 0.70157068
## TRUE 0.18062827 0.11780105 0.29842932
## Sum 0.82460733 0.17539267 1.00000000
# define a loss function (mean prediction error)
loss_func <- function(class, prob) {
n_wrong <- abs(class - prob) > 0.5
mean(n_wrong)
}
# call loss_func to compute the average number of wrong predictions in the (training) data
loss_func(class = alc$high_use, prob = alc$probability)
## [1] 0.2382199
As we interpret the output we can see that the estimate for the health-variable’s coefficient ranges from 0.9932 ta 1.4258 (95% confidence interval). This means that the coefficient can be either negative (reduces the odds of being high_use) or it can be positive, therefore the variable coefficient is not statistically signifficant. Family relations has a negative relationship with high_use, but the rest have a postive relationship.
After this I calculated the crosstabulation and there we can see the number of misclassifications and rightly classified instances. After that I defined the loss function and calculated the prediction error (using the whole data as a training set). Model’s training error is 0.238 and thus model accuracy is then 76.2%. This is signifficantly better than random classifier which is on average 50% right at the time.
Finally, this is the BONUS task. One was supposed to try to find a better model and use 10-fold cross validation.
#health does not affect statistically so we will replace it with sex
m <- glm(high_use ~ absences + famrel + goout + sex, data = alc, family = "binomial")
summary(m)
##
## Call:
## glm(formula = high_use ~ absences + famrel + goout + sex, family = "binomial",
## data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.7151 -0.7820 -0.5137 0.7537 2.5463
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.76826 0.66170 -4.184 2.87e-05 ***
## absences 0.08168 0.02200 3.713 0.000205 ***
## famrel -0.39378 0.14035 -2.806 0.005020 **
## goout 0.76761 0.12316 6.232 4.59e-10 ***
## sexM 1.01234 0.25895 3.909 9.25e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 465.68 on 381 degrees of freedom
## Residual deviance: 379.81 on 377 degrees of freedom
## AIC: 389.81
##
## Number of Fisher Scoring iterations: 4
#cross-validation
library(boot)
cv <- alc %>%
cv.glm(cost = loss_func, glmfit = m, K = 10)
# average number of wrong predictions in the cross validation
cv$delta[1]
## [1] 0.2094241
Here we can see clearly that the test set performance is much better than DataCamp’s model which had about 0.26 error.
First, I started with data wrangling exercise. I read the data from http://www.helsinki.fi/~kvehkala/JYTmooc/JYTOPKYS3-data.txt.
library(Matrix)
library(ggplot2)
library(dplyr)
setwd("/Users/veikko/Desktop/datakurssi/IODS-project/data")
lrn14 <- read.table("JYTOPKYS3-data.txt", header = T, sep = "\t")
str(lrn14)
## 'data.frame': 183 obs. of 60 variables:
## $ Aa : int 3 2 4 4 3 4 4 3 2 3 ...
## $ Ab : int 1 2 1 2 2 2 1 1 1 2 ...
## $ Ac : int 2 2 1 3 2 1 2 2 2 1 ...
## $ Ad : int 1 2 1 2 1 1 2 1 1 1 ...
## $ Ae : int 1 1 1 1 2 1 1 1 1 1 ...
## $ Af : int 1 1 1 1 1 1 1 1 1 2 ...
## $ ST01 : int 4 4 3 3 4 4 5 4 4 4 ...
## $ SU02 : int 2 2 1 3 2 3 2 2 1 2 ...
## $ D03 : int 4 4 4 4 5 5 4 4 5 4 ...
## $ ST04 : int 4 4 4 4 3 4 2 5 5 4 ...
## $ SU05 : int 2 4 2 3 4 3 2 4 2 4 ...
## $ D06 : int 4 2 3 4 4 5 3 3 4 4 ...
## $ D07 : int 4 3 4 4 4 5 4 4 5 4 ...
## $ SU08 : int 3 4 1 2 3 4 4 2 4 2 ...
## $ ST09 : int 3 4 3 3 4 4 2 4 4 4 ...
## $ SU10 : int 2 1 1 1 2 1 1 2 1 2 ...
## $ D11 : int 3 4 4 3 4 5 5 3 4 4 ...
## $ ST12 : int 3 1 4 3 2 3 2 4 4 4 ...
## $ SU13 : int 3 3 2 2 3 1 1 2 1 2 ...
## $ D14 : int 4 2 4 4 4 5 5 4 4 4 ...
## $ D15 : int 3 3 2 3 3 4 2 2 3 4 ...
## $ SU16 : int 2 4 3 2 3 2 3 3 4 4 ...
## $ ST17 : int 3 4 3 3 4 3 4 3 4 4 ...
## $ SU18 : int 2 2 1 1 1 2 1 2 1 2 ...
## $ D19 : int 4 3 4 3 4 4 4 4 5 4 ...
## $ ST20 : int 2 1 3 3 3 3 1 4 4 2 ...
## $ SU21 : int 3 2 2 3 2 4 1 3 2 4 ...
## $ D22 : int 3 2 4 3 3 5 4 2 4 4 ...
## $ D23 : int 2 3 3 3 3 4 3 2 4 4 ...
## $ SU24 : int 2 4 3 2 4 2 2 4 2 4 ...
## $ ST25 : int 4 2 4 3 4 4 1 4 4 4 ...
## $ SU26 : int 4 4 4 2 3 2 1 4 4 4 ...
## $ D27 : int 4 2 3 3 3 5 4 4 5 4 ...
## $ ST28 : int 4 2 5 3 5 4 1 4 5 2 ...
## $ SU29 : int 3 3 2 3 3 2 1 2 1 2 ...
## $ D30 : int 4 3 4 4 3 5 4 3 4 4 ...
## $ D31 : int 4 4 3 4 4 5 4 4 5 4 ...
## $ SU32 : int 3 5 5 3 4 3 4 4 3 4 ...
## $ Ca : int 2 4 3 3 2 3 4 2 3 2 ...
## $ Cb : int 4 4 5 4 4 5 5 4 5 4 ...
## $ Cc : int 3 4 4 4 4 4 4 4 4 4 ...
## $ Cd : int 4 5 4 4 3 4 4 5 5 5 ...
## $ Ce : int 3 5 3 3 3 3 4 3 3 4 ...
## $ Cf : int 2 3 4 4 3 4 5 3 3 4 ...
## $ Cg : int 3 2 4 4 4 5 5 3 5 4 ...
## $ Ch : int 4 4 2 3 4 4 3 3 5 4 ...
## $ Da : int 3 4 1 2 3 3 2 2 4 1 ...
## $ Db : int 4 3 4 4 4 5 4 4 2 4 ...
## $ Dc : int 4 3 4 5 4 4 4 4 4 4 ...
## $ Dd : int 5 4 1 2 4 4 5 3 5 2 ...
## $ De : int 4 3 4 5 4 4 5 4 4 2 ...
## $ Df : int 2 2 1 1 2 3 1 1 4 1 ...
## $ Dg : int 4 3 3 5 5 4 4 4 5 1 ...
## $ Dh : int 3 3 1 4 5 3 4 1 4 1 ...
## $ Di : int 4 2 1 2 3 3 2 1 4 1 ...
## $ Dj : int 4 4 5 5 3 5 4 5 2 4 ...
## $ Age : int 53 55 49 53 49 38 50 37 37 42 ...
## $ Attitude: int 37 31 25 35 37 38 35 29 38 21 ...
## $ Points : int 25 12 24 10 22 21 21 31 24 26 ...
## $ gender : Factor w/ 2 levels "F","M": 1 2 1 2 2 1 2 1 2 1 ...
dim(lrn14)
## [1] 183 60
Then I modified the dataset. I followed the instructions and created attitude, deep, stra and surf variables by combining questions in the learning2014 data.
deep_questions <- c("D03", "D11", "D19", "D27", "D07", "D14", "D22", "D30","D06", "D15", "D23", "D31")
surface_questions <- c("SU02","SU10","SU18","SU26", "SU05","SU13","SU21","SU29","SU08","SU16","SU24","SU32")
strategic_questions <- c("ST01","ST09","ST17","ST25","ST04","ST12","ST20","ST28")
# select the columns related to deep learning and create column 'deep' by averaging
deep_columns <- select(lrn14, one_of(deep_questions))
lrn14$deep <- rowMeans(deep_columns)
# select the columns related to surface learning and create column 'surf' by averaging
surface_columns <- select(lrn14, one_of(surface_questions))
lrn14$surf <- rowMeans(surface_columns)
# select the columns related to strategic learning and create column 'stra' by averaging
strategic_columns <- select(lrn14, one_of(strategic_questions))
lrn14$stra <- rowMeans(strategic_columns)
Then I scaled all combination variables to the original scale and finally, excluded observations where the exam points variable is zero. My final list of variables was then: gender, age, attitude, deep, stra, surf and points.
colnames(lrn14)[57] <- "age"
colnames(lrn14)[58] <- "attitude"
colnames(lrn14)[59] <- "points"
varsToGet <- c("gender", "age", "attitude", "deep", "stra", "surf", "points")
new_lrn14 <- select(lrn14, one_of(varsToGet))
new_lrn14 <- filter(new_lrn14, points > 0)
After this, I changed my working directory to the IODS project folder and saved the analysis dataset to the folder as a csv-file.
setwd("/Users/veikko/Desktop/datakurssi/IODS-project")
getwd()
## [1] "/Users/veikko/Desktop/datakurssi/IODS-project"
write.csv(new_lrn14, file = "learning2014.csv", sep = ",")
Then I wanted to check that everything went as planned, so I read the just created csv-file.
learning2014 <- read.csv("learning2014.csv", sep = ",", header = T)
learning2014 <- learning2014[-1]
str(learning2014)
## 'data.frame': 166 obs. of 7 variables:
## $ gender : Factor w/ 2 levels "F","M": 1 2 1 2 2 1 2 1 2 1 ...
## $ age : int 53 55 49 53 49 38 50 37 37 42 ...
## $ attitude: int 37 31 25 35 37 38 35 29 38 21 ...
## $ deep : num 3.58 2.92 3.5 3.5 3.67 ...
## $ stra : num 3.38 2.75 3.62 3.12 3.62 ...
## $ surf : num 2.58 3.17 2.25 2.25 2.83 ...
## $ points : int 25 12 24 10 22 21 21 31 24 26 ...
head(learning2014)
## gender age attitude deep stra surf points
## 1 F 53 37 3.583333 3.375 2.583333 25
## 2 M 55 31 2.916667 2.750 3.166667 12
## 3 F 49 25 3.500000 3.625 2.250000 24
## 4 M 53 35 3.500000 3.125 2.250000 10
## 5 M 49 37 3.666667 3.625 2.833333 22
## 6 F 38 38 4.750000 3.625 2.416667 21
#Almost forgot to scale the attitude variable
learning2014$attitude <- learning2014$attitude / 10
head(learning2014)
## gender age attitude deep stra surf points
## 1 F 53 3.7 3.583333 3.375 2.583333 25
## 2 M 55 3.1 2.916667 2.750 3.166667 12
## 3 F 49 2.5 3.500000 3.625 2.250000 24
## 4 M 53 3.5 3.500000 3.125 2.250000 10
## 5 M 49 3.7 3.666667 3.625 2.833333 22
## 6 F 38 3.8 4.750000 3.625 2.416667 21
str(learning2014)
## 'data.frame': 166 obs. of 7 variables:
## $ gender : Factor w/ 2 levels "F","M": 1 2 1 2 2 1 2 1 2 1 ...
## $ age : int 53 55 49 53 49 38 50 37 37 42 ...
## $ attitude: num 3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
## $ deep : num 3.58 2.92 3.5 3.5 3.67 ...
## $ stra : num 3.38 2.75 3.62 3.12 3.62 ...
## $ surf : num 2.58 3.17 2.25 2.25 2.83 ...
## $ points : int 25 12 24 10 22 21 21 31 24 26 ...
summary(learning2014)
## gender age attitude deep stra
## F:110 Min. :17.00 Min. :1.400 Min. :1.583 Min. :1.250
## M: 56 1st Qu.:21.00 1st Qu.:2.600 1st Qu.:3.333 1st Qu.:2.625
## Median :22.00 Median :3.200 Median :3.667 Median :3.188
## Mean :25.51 Mean :3.143 Mean :3.680 Mean :3.121
## 3rd Qu.:27.00 3rd Qu.:3.700 3rd Qu.:4.083 3rd Qu.:3.625
## Max. :55.00 Max. :5.000 Max. :4.917 Max. :5.000
## surf points
## Min. :1.583 Min. : 7.00
## 1st Qu.:2.417 1st Qu.:19.00
## Median :2.833 Median :23.00
## Mean :2.787 Mean :22.72
## 3rd Qu.:3.167 3rd Qu.:27.75
## Max. :4.333 Max. :33.00
Just to briefly repeat, the dataset consists of seven variables: gender, age, attitude, deep, stra, surf and points. In this analysis points is the dependent variable.
I have already, in previous section, provided a summary of the variables, which shows numerically min, max, mean, median and quantile infromation regarding each variable. Next, I will show three graphical illustrations about the data. First, I will show an overview of the data with a ggpairs function observations grouped by gender. Then I’ll make a histogram of the points variable. Finally, I am going to group the data by exam performance into three groups (as close to equal size as possible) and provide a ggpairs illustration with that grouping.
library(GGally)
#first ggpairs
ggpairs(learning2014, mapping = aes(col = gender, alpha = 0.3),
lower = list(combo = wrap("facethist", bins = 20)))
#drawing a histogram of points
tbl <- table(learning2014$points)
f <- factor(learning2014$points, levels = 7:33)
barplot(table(f), xlab = "Points", ylab = "Count")
#splitting the data into 3 groups based on points variable
learning2014$pgroup <- as.numeric(cut_number(learning2014$points,3))
learning2014$pgroup2 <- factor(learning2014$pgroup, labels = c("low", "middle", "high"))
table(learning2014$pgroup2)
##
## low middle high
## 68 45 53
#drawing a second ggpairs
test.learn <- subset(learning2014, select = c("age","attitude", "deep", "stra", "surf", "points", "pgroup2"))
ggpairs(test.learn, mapping = aes(col = pgroup2, alpha = 0.3),
lower = list(combo = wrap("facethist", bins = 20)))
We can see many interesting things from the plots previously drawn. First of all, there is a gender based difference: men are doing better in terms of exam points than women, but this would require statistical test in order to determine whether the difference is statistically signifficant or not. From the first ggpairs matrix, one can also see the correlation coefficients and that attitude has the highest correlation with points. High attitude correlates with high exam points. Similarly, stra variable has a positive correlation with points. Surf, deep and age on the other hand, seem to be negatively correlated with points. Correlation coefficients are different for men and women, but those differences do not seem to be that high, if we just look at points vs. other variables (correlation coefficients are within 0.1 range between genders in case of same variables).
From the two other plots we can see how the points are distributed. As an extra I grouped the data by dividing points results into three categories “low”, “middle” and “high”. In the second ggpairs matrix one can see that there is much more variation between correlation coefficients of different groups than in comparison to grouping by gender.
I chose not to follow the homework instruction so strictly at this point, because the idea of manually iterating through the model selection seems pointless. One should also remember that it matters in which order one includes the variables into the model, so one should not pointlessly start experimenting with different variables. Our data is luckily low in dimensions, so using the best subset selection is a possibility. I reckoned that I would let the best subset selection do the variable selection for me; to present me with the best model with three variables (as requested in the exercise instructions). Here is how I did it:
library(leaps)
#let's remove those two additional variables
drops <- c("pgroup", "pgroup2")
learning2014_org <- learning2014[ , !(names(learning2014) %in% drops)]
#model validation by best subset selection
regfit.full <- regsubsets(points~.,learning2014_org)
reg.summary <- summary(regfit.full)
#some validation metrics
which.max(reg.summary$adjr2)
## [1] 5
which.min(reg.summary$cp )
## [1] 3
which.min(reg.summary$bic )
## [1] 1
#let's try to do just one plot
par(mfrow = c(1,1))
plot(reg.summary$cp ,xlab="Number of Variables ",ylab="Cp", type="l")
points(3,reg.summary$cp[3],col="red",cex=2,pch=20)
#here we select the best model with 3 variables
coef(regfit.full ,3)
## (Intercept) age attitude stra
## 10.89542805 -0.08821869 3.48076791 1.00371238
Apparrently, choosing three variables is not such a bad idea, as Cp validation metric recommends three variables to be in the ideal model. Most importantly, now we know the variables for the best three variable model: attitude, stra and age. Next we do the model fitting with those variables.
lm.fit <- lm(points ~ attitude + stra + age, data=learning2014_org)
summary(lm.fit)
##
## Call:
## lm(formula = points ~ attitude + stra + age, data = learning2014_org)
##
## Residuals:
## Min 1Q Median 3Q Max
## -18.1149 -3.2003 0.3303 3.4129 10.7599
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.89543 2.64834 4.114 6.17e-05 ***
## attitude 3.48077 0.56220 6.191 4.72e-09 ***
## stra 1.00371 0.53434 1.878 0.0621 .
## age -0.08822 0.05302 -1.664 0.0981 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.26 on 162 degrees of freedom
## Multiple R-squared: 0.2182, Adjusted R-squared: 0.2037
## F-statistic: 15.07 on 3 and 162 DF, p-value: 1.07e-08
par(mfrow = c(2,2))
plot(lm.fit, which = c(1,2,5))
In the output we can see the model fit and some important information regarding the model. attitude seems to be the only statistically signifficant variable, the others have a p-value higher than 0.5. Age has a mild negative coefficient and affects negatively to the points variable. Gaining one point higher stra value increases points by one and attitude by almost 3.5. R^2 is 0.2182 and Adjusted R^2 is 0.2037, which mean that the model explains 1/5 of the variance of the dependent variable. Adjusted R^2 is a measure for comparing different models as R^2 increases naturally by including more variables, as this is not the case with Adjusted R^2. F-statistic tells us that coefficients are not zero.
We can observe from the regression diagnostic plots that there are no huge problems with the residuals (no signs of heteroscedasticity) and there does not seem to be individual points with model manipulative amounts of leverage.
I am very excited about this Introduction to Open Data Science course, as I feel that it offers an opportunity to learn data science and statistical tools, which I have been using in my own projects. I think that the Datacamp learning environment was a good choice for the course and it is very easy to use.
Here is a link to my github repository: IODS course